Goto

Collaborating Authors

 single layer




A Additional HQA Results Table 5: Additional CelebA interpolations of the HQA encoder output z

Neural Information Processing Systems

Compression is from 98,304 to 576 bits (171x compression). Compression is from 98,304 to 144 bits (683x compression). The far left and right images are originals. B.1 Motivation In this section we outline the probabilistic model that motivates the HQA loss: L = log p (x | z = k) H [ q ( z |x)] + E A desired property of the HQA, motivated in Section 4.4, is the non-deterministic posterior We contrast these two models in Figure 8. This model is a V ariational Autoencoder with a simple Mixture of Gaussians prior.


Two Heads Are Better than One: Simulating Large Transformers with Small Ones

Yu, Hantao, Alman, Josh

arXiv.org Artificial Intelligence

The quadratic complexity of self-attention prevents transformers from scaling effectively to long input sequences. On the other hand, modern GPUs and other specialized hardware accelerators are well-optimized for processing small input sequences in transformers during both training and inference. A natural question arises: can we take advantage of the efficiency of small transformers to deal with long input sequences? In this paper, we show that transformers with long input sequences (large transformers) can be efficiently simulated by transformers that can only take short input sequences (small transformers). Specifically, we prove that any transformer with input length $N$ can be efficiently simulated by only $O((N/M)^2)$ transformers with input length $M \ll N$, and that this cannot be improved in the worst case. However, we then prove that in various natural scenarios including average-case inputs, sliding window masking and attention sinks, the optimal number $O(N/M)$ of small transformers suffice.


xLSTM 7B: A Recurrent LLM for Fast and Efficient Inference

Beck, Maximilian, Pöppel, Korbinian, Lippe, Phillip, Kurle, Richard, Blies, Patrick M., Klambauer, Günter, Böck, Sebastian, Hochreiter, Sepp

arXiv.org Artificial Intelligence

Recent breakthroughs in solving reasoning, math and coding problems with Large Language Models (LLMs) have been enabled by investing substantial computation budgets at inference time. Therefore, inference speed is one of the most critical properties of LLM architectures, and there is a growing need for LLMs that are efficient and fast at inference. Recently, LLMs built on the xLSTM architecture have emerged as a powerful alternative to Transformers, offering linear compute scaling with sequence length and constant memory usage, both highly desirable properties for efficient inference. However, such xLSTM-based LLMs have yet to be scaled to larger models and assessed and compared with respect to inference speed and efficiency. In this work, we introduce xLSTM 7B, a 7-billion-parameter LLM that combines xLSTM's architectural benefits with targeted optimizations for fast and efficient inference. Our experiments demonstrate that xLSTM 7B achieves performance on downstream tasks comparable to other similar-sized LLMs, while providing significantly faster inference speeds and greater efficiency compared to Llama- and Mamba-based LLMs. These results establish xLSTM 7B as the fastest and most efficient 7B LLM, offering a solution for tasks that require large amounts of test-time computation. Our work highlights xLSTM's potential as a foundational architecture for methods building on heavy use of LLM inference. Our model weights, model code and training code are open-source.


Metals can be squeezed into sheets just a few atoms thick

New Scientist

Sheets of metal just two atoms thick can be produced by squashing molten droplets at great pressure between two sapphires. The researchers who developed the process say the unusual materials could have applications in industrial chemistry, optics and computers. Last year, scientists created a gold sheet that was a single atom thick, which they dubbed "goldene" after graphene, a material made of a single layer of carbon atoms. Such materials have been described as two-dimensional, as they are as thin as chemically possible. But making other 2D metals hadn't been possible until now. The new technique, developed by Luojun Du at the Chinese Academy of Sciences and his colleagues, can create 2D sheets of bismuth, gallium, indium, tin and lead that are as thin as their atomic bonds allow.


Route Sparse Autoencoder to Interpret Large Language Models

Shi, Wei, Li, Sihang, Liang, Tao, Wan, Mingyang, Ma, Gojun, Wang, Xiang, He, Xiangnan

arXiv.org Artificial Intelligence

Mechanistic interpretability of large language models (LLMs) aims to uncover the internal processes of information propagation and reasoning. Sparse autoencoders (SAEs) have demonstrated promise in this domain by extracting interpretable and monosemantic features. However, prior works primarily focus on feature extraction from a single layer, failing to effectively capture activations that span multiple layers. In this paper, we introduce Route Sparse Autoencoder (RouteSAE), a new framework that integrates a routing mechanism with a shared SAE to efficiently extract features from multiple layers. It dynamically assigns weights to activations from different layers, incurring minimal parameter overhead while achieving high interpretability and flexibility for targeted feature manipulation. We evaluate RouteSAE through extensive experiments on Llama-3.2-1B-Instruct. Specifically, under the same sparsity constraint of 64, RouteSAE extracts 22.5% more features than baseline SAEs while achieving a 22.3% higher interpretability score. These results underscore the potential of RouteSAE as a scalable and effective method for LLM interpretability, with applications in feature discovery and model intervention. Our codes are available at https://github.com/swei2001/RouteSAEs.


Residual Stream Analysis with Multi-Layer SAEs

Lawson, Tim, Farnik, Lucy, Houghton, Conor, Aitchison, Laurence

arXiv.org Artificial Intelligence

Sparse autoencoders (SAEs) are a promising approach to interpreting the internal representations of transformer language models. However, standard SAEs are trained separately on each transformer layer, making it difficult to use them to study how information flows across layers. To solve this problem, we introduce the multi-layer SAE (MLSAE): a single SAE trained on the residual stream activation vectors from every transformer layer simultaneously. The residual stream is usually understood as preserving information across layers, so we expected to, and did, find individual SAE features that are active at multiple layers. Interestingly, while a single SAE feature is active at different layers for different prompts, for a single prompt, we find that a single feature is far more likely to be active at a single layer. For larger underlying models, we find that the cosine similarities between adjacent layers in the residual stream are higher, so we expect more features to be active at multiple layers. These results show that MLSAEs are a promising method to study information flow in transformers.


What does self-attention learn from Masked Language Modelling?

Rende, Riccardo, Gerace, Federica, Laio, Alessandro, Goldt, Sebastian

arXiv.org Machine Learning

Transformers are neural networks which revolutionised natural language processing and machine learning. They process sequences of inputs, like words, using a mechanism called self-attention, which is trained via masked language modelling (MLM). In MLM, a word is randomly masked in an input sequence, and the network is trained to predict the missing word. Despite the practical success of transformers, it remains unclear what type of data distribution self-attention can learn efficiently. Here, we show analytically that if one decouples the treatment of word positions and embeddings, a single layer of self-attention learns the conditionals of a generalised Potts model with interactions between sites and Potts colours. Moreover, we show that training this neural network is exactly equivalent to solving the inverse Potts problem by the so-called pseudo-likelihood method, well known in statistical physics. Using this mapping, we compute the generalisation error of self-attention in a model scenario analytically using the replica method.


A Unified View Between Tensor Hypergraph Neural Networks And Signal Denoising

Wang, Fuli, Pena-Pena, Karelia, Qian, Wei, Arce, Gonzalo R.

arXiv.org Artificial Intelligence

Hypergraph Neural networks (HyperGNNs) and hypergraph signal denoising (HyperGSD) are two fundamental topics in higher-order network modeling. Understanding the connection between these two domains is particularly useful for designing novel HyperGNNs from a HyperGSD perspective, and vice versa. In particular, the tensor-hypergraph convolutional network (T-HGCN) has emerged as a powerful architecture for preserving higher-order interactions on hypergraphs, and this work shows an equivalence relation between a HyperGSD problem and the T-HGCN. Inspired by this intriguing result, we further design a tensor-hypergraph iterative network (T-HGIN) based on the HyperGSD problem, which takes advantage of a multi-step updating scheme in every single layer. Numerical experiments are conducted to show the promising applications of the proposed T-HGIN approach.